Setting the Environment

The task has the specification to develop the solution in shiny using shiny.semntic 4.0.0. We will use, data.table for data I/O and data wrangling since it excells on speed and and memory footrint (see the automated benchmarks from H2O. Also, all data that will be used by the shiny server will be saved in fst format, one of the fastest R binary formats.

library(data.table, warn.conflicts = FALSE)
library(magrittr) 
library(googledrive)
library(ggplot2)
library(fst)
library(geodist)

Exploratory data analysis

In this section we download data with a non-interactive authorization retaining a controled access to the google drive repo and ensuring reproducibility throughout the process. Then the data are loaded for Exploratory Data Analysis.

Calculations are performed with assignment by reference using the := operator of the data.table extension to data.frame. We will use this technique extensively during most of the computations due to its optimized performance since it performs minimal copies on memory. Also, data wrangling is performed within the data.table environment to take adavantage of the optimised backend. We carefully use vectorised operations avoiding for loops. This, is an essential technique in high-level programming languages since reduces function which are costly.

Download Data

We will setup a process that requires no user interaction. This process ensures reproducibility as the document will render with no intervention and is key to maintainability, continues integration, development and testing. Also we will get familiarized with integrating Google Cloud Services with data pipelines using a service account.

To ensure no human involvement the most appropriate token to use is the service account token. Boldly speaking there are two steps involved in the process,

Step 1: Get a service account and then download a token in json format.

Step 2: Call the auth function proactively and provide the path to your service account token.

Concerning the first step a google account is needed. The definition can be done through the Google Cloud Platform project (see Documentation). The process is as follows:

  1. Navigate from the Developers Console to your service account.

  2. Follow the steps to define the service account.

Do Create key and download as JSON. This is the service account token.

Through the service account, the computing session has limited visibility to google drive. Only files shared with the service account can be accessed. Since I’m not an owner of the shared dataset I created a copy on my drive and shared via the email of the service account.

Notice: In a CI/CD case where the process is committed and pushed the .json should be encrypted.

In the following screen shot is the definition of the service account. The developer should further define that is for google drive API usage.

“Service Account screenshot”

dir.create("dataset/", showWarnings = FALSE)
options(gargle_oauth_email = TRUE)
drive_auth(path = ".buoyant-purpose-232220-c0625ed45e4f.json", email = TRUE)
## → Using an auto-discovered, cached token
##   To suppress this message, modify your code or options to clearly consent to the use of
##   a cached token
##   See gargle's "Non-interactive auth" vignette for more details:
##   <https://gargle.r-lib.org/articles/non-interactive-auth.html>
## → The googledrive package is using a cached token for 'akourets@gmail.com'
drive_download("ships_04112020.zip", path = "dataset/ships.zip", overwrite = TRUE)
## File downloaded:
##   * ships_04112020.zip
## Saved locally as:
##   * dataset/ships.zip

Load Data

ships_dt <- fread('unzip -p dataset/ships.zip', stringsAsFactors = TRUE)
str(ships_dt)
## Classes 'data.table' and 'data.frame':   3102887 obs. of  20 variables:
##  $ LAT        : num  54.8 54.8 54.8 54.8 54.7 ...
##  $ LON        : num  19 19 19 19 19 ...
##  $ SPEED      : int  99 100 102 102 102 102 101 101 102 104 ...
##  $ COURSE     : int  200 200 196 198 196 198 197 194 199 199 ...
##  $ HEADING    : int  196 196 196 196 195 196 196 196 196 197 ...
##  $ DESTINATION: Factor w/ 646 levels "0046705808","3.E MAN",..: 151 151 151 151 151 151 151 151 151 151 ...
##  $ FLAG       : Factor w/ 45 levels "--","AG","BB",..: 31 31 31 31 31 31 31 31 31 31 ...
##  $ LENGTH     : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ SHIPNAME   : Factor w/ 1187 levels ". PRINCE OF WAVES",..: 503 503 503 503 503 503 503 503 503 503 ...
##  $ SHIPTYPE   : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ SHIP_ID    :integer64 2764 2764 2764 2764 2764 2764 2764 2764 ... 
##  $ WIDTH      : int  14 14 14 14 14 14 14 14 14 14 ...
##  $ DWT        : int  5727 5727 5727 5727 5727 5727 5727 5727 5727 5727 ...
##  $ DATETIME   : POSIXct, format: "2016-12-19 11:29:01" "2016-12-19 11:31:02" ...
##  $ PORT       : Factor w/ 6 levels "gdansk","gdynia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date       : IDate, format: "2016-12-19" "2016-12-19" ...
##  $ week_nb    : int  51 51 51 51 51 51 51 51 51 51 ...
##  $ ship_type  : Factor w/ 9 levels "Cargo","Fishing",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ port       : Factor w/ 6 levels "Gdańsk","Gdynia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ is_parked  : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We observe that the dataset is relatively “healthy”. There are some missing values, while the column names are descriptive. There are ~3 Million observations.

Tidy data

We can bring the data in a more “tidy” format by extracting a vessels table. In that sense each row is an observation and each column in a variable while each type of observational unit is a table.

In the vessels table we will store invariant columns with respect to geolocation and time.

vessel_vars <- c("SHIPNAME", "LENGTH", "SHIPTYPE", "WIDTH", "FLAG", "ship_type")
vessels_dt <- ships_dt[, c("SHIP_ID", vessel_vars), with = FALSE] %>%
  unique
ships_dt <- ships_dt[, -vessel_vars, with = FALSE]

Inspecting cases,

dup_ids <- vessels_dt[, table(SHIP_ID) > 1] %>%
  which %>%
  names %>% 
  as.integer

vessels_dt[SHIP_ID %in% dup_ids][order(SHIP_ID)] %>%
  head
##    SHIP_ID           SHIPNAME LENGTH SHIPTYPE WIDTH FLAG ship_type
## 1:  158315 R 355 TRINE LOUISE     20        2     6   DK   Fishing
## 2:  158315 R 355 TRINE LOUISE     19        2     6   DK   Fishing
## 3:  315731               ODYS     35        3     9   PL       Tug
## 4:  315731               BBAS     35        3     9   PL       Tug
## 5:  315950           .WLA-311     24        2     6   PL   Fishing
## 6:  315950            WLA-311     24        2     6   PL   Fishing

There is some noise in the data and vessels look very similar for each duplicated id.

vessels_dt <- vessels_dt[, head(.SD, 1), by = "SHIP_ID"]
testthat::expect_false(any(vessels_dt[, table(SHIP_ID) > 1]))
vessels_dt <- vessels_dt[SHIPNAME != "[SAT-AIS]"]
vessels_dt %>%
  head
##    SHIP_ID  SHIPNAME LENGTH SHIPTYPE WIDTH FLAG ship_type
## 1:    2764    KAROLI    100        7    14   MT     Cargo
## 2:    3338     KERLI     89        7    13   MT     Cargo
## 3:    3615 FINNKRAFT    162        7    20   FI     Cargo
## 4:    5628  FINNPULP    187        7    26   FI     Cargo
## 5:    6223      MERI    105        7    18   FI     Cargo
## 6:    6333   BALTICO    169        8    24   SE    Tanker
DT::datatable(vessels_dt)

Longest distance calculation

We will pre-compute for each vessel the longest distance traveled between two consecutive observations.

ships_dt <- ships_dt[order(SHIP_ID, DATETIME)]
ships_dt[, `:=`(fromLAT = shift(LAT), fromLON = shift(LON), fromTIME = shift(DATETIME)), by = "SHIP_ID"]

For geodistance calculation we will use the lightweight and efficient geodist library and we will use Karney distance (“Algorithms for geodesics” J Geod 87:43-55) since it is accurate and for large distances.

ships_dt[, DIST := geodist(cbind(LON, LAT), cbind(fromLON, fromLAT), paired = TRUE, measure = "geodesic")]

ships_long_dist <- ships_dt[
  !is.na(DIST)][
    order(SHIP_ID, DATETIME, DIST)][
      , tail(.SD, 1), by = "SHIP_ID"][
        , -c("date", "week_nb")]
vessels_dt <- merge(vessels_dt, ships_long_dist, on = "SHIP_ID", all.x = TRUE)
DT::datatable(
  vessels_dt,
  extensions = 'Buttons',
  options = list(
    scrollX = TRUE,
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Save data

Saving vessels data

fwrite(vessels_dt, "vessels.csv")

Download data

Data generated from this report: